Skip to content

fix: Publisher-mode synchronization option for failover scenario#3222

Open
alpe wants to merge 4 commits intomainfrom
alex/sync_race
Open

fix: Publisher-mode synchronization option for failover scenario#3222
alpe wants to merge 4 commits intomainfrom
alex/sync_race

Conversation

@alpe
Copy link
Copy Markdown
Contributor

@alpe alpe commented Mar 31, 2026

Overview

E2E HA tests fail sometimes on a race when the leader is waiting for p2p sync complete on a fresh start.

Summary by CodeRabbit

  • New Features

    • Publisher-mode sync startup for early P2P readiness and failover publishing; "start for publishing" option to start sync services without full P2P init.
  • Documentation

    • Updated Raft bootstrap/startup docs and CLI flag guidance; bootstrap now documented as compatibility-only and startup is auto-selected.
  • Improvements

    • Hardened Raft startup, leader election, failover recovery, and sync startup/retry/logging behavior.
  • Tests

    • Expanded unit and E2E coverage for raft startup, failover, sync publishing, and logging.
  • Chores

    • Updated changelog with unreleased publisher-mode entry.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

Adds a publisher-mode startup path for sync services during Raft failover, refactors syncer lifecycle and status semantics, adjusts Raft startup/election behavior, updates E2E P2P identity and process logging, enables local module replace directives, and updates docs/config and changelog.

Changes

Cohort / File(s) Summary
Publisher-mode startup & failover
node/failover.go, test/e2e/failover_e2e_test.go
Add publisher-mode decision for raft leaders with empty stores; start header/data sync via StartForPublishing; log store/header/data heights and warn on unknown heights; adjust E2E wiring for P2P addresses.
Sync service API & lifecycle
pkg/sync/sync_service.go, pkg/sync/sync_service_test.go, pkg/sync/syncer_status.go, pkg/sync/syncer_status_test.go
Add StartForPublishing, refactor startSyncer to return (bool,error), move syncer setup into prepareStart, replace atomic started flag with mutex-protected started and startOnce/stopIfStarted, update/add tests (publishing + concurrency).
Block sync validation & recovery
block/internal/syncing/syncer.go, block/internal/syncing/syncer_test.go
Extract trySyncNextBlockWithState(...) and use explicit currentState during recovery paths, set InitialHeight on genesis fallback, add Raft recovery tests (bootstrap when uninitialized, chain ID validation).
Raft startup & election behavior
pkg/raft/node.go, pkg/raft/node_test.go, pkg/raft/election.go, pkg/raft/election_test.go
Auto-select startup mode based on persisted Raft configuration (de-emphasize Bootstrap flag), refactor follower start/verify with polling, attempt recovery on sync-check failures, add nil-receiver test and adjust mocks.
E2E infra & process helpers
test/e2e/sut_helper.go, test/e2e/failover_e2e_test.go
Per-node libp2p identity/multiaddr support (separate listen vs peer addresses), improved leader detection tolerance, capture cmd.Wait() errors in process cleanup logs, conditional log-dir creation and simplified log filenames, safer tickers/timeouts.
Docs & config strings
docs/guides/raft_production.md, docs/learn/config.md, pkg/config/config.go, pkg/config/config_test.go, CHANGELOG.md
Document Raft startup-mode auto-selection and legacy raft.bootstrap description, update example flag namespaces, add raft flag assertions in tests, and add changelog entry for publisher-mode sync option.
Module replace activation
apps/evm/go.mod
Uncomment local replace directives for github.com/evstack/ev-node and .../execution/evm to point to local relative paths.

Sequence Diagram(s)

sequenceDiagram
    participant Failover as Failover Manager
    participant Raft as Raft Node
    participant Store as Block Store
    participant Sync as Sync Service
    participant P2P as P2P Init

    Note over Failover,Raft: Startup Run()

    Failover->>Raft: Is leader? / get raft config
    alt Leader & persisted raft config present
        Raft-->>Failover: leader=true
        Failover->>Store: query header/data heights
        Store-->>Failover: heights empty
        Failover->>Sync: StartForPublishing(ctx)
        Sync->>Sync: prepareStart (no initFromP2PWithRetry)
        Sync->>P2P: startSubscriber
        Sync-->>Failover: started (publishing)
    else Not publishing
        Failover->>Sync: Start(ctx)
        Sync->>P2P: initFromP2PWithRetry (peer discovery)
        Sync->>Sync: startSubscriber
        Sync-->>Failover: started (normal)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • julienrbrt
  • tac0turtle
  • chatton

Poem

🐇 I hop with a ledger and tiny drum,
Publisher mode wakes before peers come,
Raft nods and the syncer takes flight,
Mutex guards the start through day and night,
Early P2P dawn—hop, data hums.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description identifies the problem (race condition in E2E HA tests) but lacks implementation details, rationale, and linked issue. Expand the description with: (1) why publisher-mode solves the race, (2) how it works, (3) link to related issue (#3222), and (4) any testing/rollout considerations.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: adding publisher-mode synchronization for failover scenarios.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch alex/sync_race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 2, 2026, 11:46 AM

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 31, 2026

Claude finished @alpe's task in 2m 48s —— View job


Code Review

  • Gather context and understand the PR
  • Review core sync service changes (pkg/sync/)
  • Review failover logic (node/failover.go)
  • Review raft changes (pkg/raft/)
  • Review block syncing changes (block/internal/syncing/syncer.go)
  • Review tests and E2E changes
  • Review docs and config changes
  • Post final review

Overall this is a well-motivated fix for a real race condition. The publisher-mode approach is a clean solution — skipping initFromP2PWithRetry on a fresh raft leader avoids the deadlock while keeping the P2P substrate ready for the first broadcast. A few correctness issues below that I'd address before merge.

Critical / Bug

Stop() panics if Start/StartForPublishing was never completed (pkg/sync/sync_service.go:450)

topicSubscription is set only inside startSubscriber. If prepareStart or initFromP2PWithRetry return an error, the field stays nil. Stop() then unconditionally calls syncService.topicSubscription.Cancel() — nil-pointer panic.

In failover.go the defer is registered only after syncWg.Wait() succeeds, so the current happy/error path is safe. But the Stop() method is part of a public interface and callers can't be expected to know this invariant. A nil guard (or initialising topicSubscription to a no-op value) is the minimal fix:

if syncService.topicSubscription != nil {
    syncService.topicSubscription.Cancel()
}

Fix this →


Major

shouldStartSyncInPublisherMode reads storeHeight but never uses it (node/failover.go:193-209)

storeHeight is fetched and logged but absent from the gate condition. A node whose main store already contains blocks (sync stores empty, e.g. after a crash and partial restart) would still take the StartForPublishing path and skip the blocking P2P sync that a non-fresh leader needs:

// current — storeHeight ignored
if headerHeight > 0 || dataHeight > 0 {
    return false
}

// should be
if storeHeight > 0 || headerHeight > 0 || dataHeight > 0 {
    return false
}

Fix this →

apps/evm/go.mod — repo-local replace directives committed (apps/evm/go.mod:5-8)

replace (
    github.com/evstack/ev-node => ../../
    github.com/evstack/ev-node/execution/evm => ../../execution/evm
)

These relative paths are only valid inside this monorepo. Once apps/evm is published as a module, any downstream consumer that adds it as a dependency will get a broken go.mod. Local dev linking should use go.work instead and these directives should be removed before merge.

Partial component leak on startup failure (pkg/sync/sync_service.go:231-245)

By the time an error in newSyncer or startSubscriber is returned from prepareStart, the store, exchange server, and exchange are already started (from setupP2PInfrastructure). Failure in either of those later steps leaks live goroutines. The fix is a deferred cleanup that runs on error:

func (syncService *SyncService[H]) prepareStart(ctx context.Context) ([]peer.ID, error) {
    peerIDs, err := syncService.setupP2PInfrastructure(ctx)
    if err != nil {
        return nil, ...
    }
    ok := false
    defer func() {
        if !ok { /* stop store, p2pServer, ex */ }
    }()
    // ... newSyncer ...
    ok = true
    return peerIDs, nil
}

Fix this →


Minor

waitForMsgsLanded ignores the caller's context (pkg/raft/node.go:145-163, election.go:93)

The follower startup path calls d.node.waitForMsgsLanded(d.node.Config().SendTimeout), but waitForMsgsLanded uses only a timer internally — it can't be cancelled by the Run context. On shutdown or role changes the follower can hang here for up to SendTimeout. Per repo conventions, long-running waits should propagate context.Context. Adding a ctx parameter (or select on ctx.Done()) would make cancellation reliable.
Fix this →

Genesis fallback in RecoverFromRaft is incomplete (block/internal/syncing/syncer.go:1204-1213)

The fallback state only sets ChainID, InitialHeight, and LastBlockHeight. Normal startup calls initializeState() which also seeds AppHash, DAHeight, LastBlockTime, etc. via exec.InitChain. Executing a recovered block against this partial state could lead to hash mismatches. Reusing the existing genesis initialisation path here would make the two code paths consistent.

--evnode.raft.bootstrap=true still present in example (docs/guides/raft_production.md:98)

The guide text now says --evnode.raft.bootstrap is a "compatibility-only" flag, but the example command still includes it. This contradicts the narrative and may confuse operators. Removing it from the example (or using it only in a dedicated compatibility section) would align docs with the new auto-detection behaviour.
Fix this →

Log file descriptor leak in E2E helper (test/e2e/sut_helper.go:192-195)

os.Create opens a file but logfile is never closed. Over many spawned processes this leaks file descriptors and can delay final log flushes. A t.Cleanup(func() { _ = logfile.Close() }) immediately after the require.NoError would fix it.
Fix this →


Nit / Positive Notes

  • The switch from atomic to mutex in SyncerStatus is the right call — the previous atomic-bool pattern was insufficient for the start-once / stop-if-started semantics and the new startOnce/stopIfStarted pair is clean.
  • Auto-selecting raft startup mode from persisted config in node.go:Start() is a nice UX improvement that removes a common operator footgun.
  • The CHANGELOG.md conflict markers are resolved in commit 53f5ded. ✅
  • Test coverage for SyncerStatus concurrency is appreciated.
  • The TestHeaderSyncServiceStartForPublishing test at sync_service_test.go:63-97 only asserts svc.storeInitialized. It would be stronger to also verify that client2 receives the broadcasted header to catch regressions where publishing succeeds locally but doesn't reach peers.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 53.84615% with 60 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.65%. Comparing base (022b565) to head (a2607e5).

Files with missing lines Patch % Lines
node/failover.go 31.25% 16 Missing and 6 partials ⚠️
pkg/sync/sync_service.go 47.36% 13 Missing and 7 partials ⚠️
pkg/raft/election.go 53.12% 10 Missing and 5 partials ⚠️
pkg/sync/syncer_status.go 90.47% 1 Missing and 1 partial ⚠️
pkg/raft/node.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3222      +/-   ##
==========================================
+ Coverage   61.43%   61.65%   +0.21%     
==========================================
  Files         120      120              
  Lines       12504    12588      +84     
==========================================
+ Hits         7682     7761      +79     
+ Misses       3960     3954       -6     
- Partials      862      873      +11     
Flag Coverage Δ
combined 61.65% <53.84%> (+0.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://evstack.github.io/docs-preview/pr-3222/

Built to branch main at 2026-04-02 11:46 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@alpe alpe force-pushed the alex/sync_race branch 2 times, most recently from ec0ffc4 to 59dc917 Compare April 2, 2026 08:51
@alpe alpe force-pushed the alex/sync_race branch from 59dc917 to 74641e0 Compare April 2, 2026 10:06
@alpe alpe changed the title fix: Publisher-mode synchronization option for failover scenario [WIP]fix: Publisher-mode synchronization option for failover scenario Apr 2, 2026
@alpe alpe marked this pull request as ready for review April 2, 2026 10:28
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
test/e2e/sut_helper.go (1)

189-195: ⚠️ Potential issue | 🟡 Minor

Close the per-process logfile handle.

os.Create on Line 192 leaves an open fd that is never closed. E2E suites spawn a lot of processes, so this can leak descriptors and delay flushing the captured logs.

Suggested change
 		logfile, err := os.Create(logfileName)
 		require.NoError(s.t, err)
+		s.t.Cleanup(func() { _ = logfile.Close() })
 		errReader = io.NopCloser(io.TeeReader(errReader, logfile))
 		outReader = io.NopCloser(io.TeeReader(outReader, logfile))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/sut_helper.go` around lines 189 - 195, The per-process logfile
created by os.Create (logfile) in the block that checks s.processLogDir() is
never closed; wrap the logfile so it is closed when the returned readers are
closed by replacing the io.NopCloser(io.TeeReader(..., logfile)) usage with a
ReadCloser that closes the underlying logfile on Close, or otherwise ensure
logfile.Close() is called (e.g., return a combined io.ReadCloser that delegates
Read to the TeeReader and Close to logfile.Close) for both errReader and
outReader so the file descriptor is released and buffered data flushed.
block/internal/syncing/syncer.go (1)

1204-1213: ⚠️ Potential issue | 🔴 Critical

Bootstrap the same genesis state as initializeState().

This fallback only sets ChainID, InitialHeight, and LastBlockHeight, but Line 808 later executes the recovered block against currentState.AppHash. On a fresh node that means raft recovery runs against a different execution baseline than normal startup, which calls exec.InitChain and seeds AppHash, DAHeight, and LastBlockTime. Please reuse that bootstrap path here before applying the raft block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/syncer.go` around lines 1204 - 1213, The fallback
after s.store.GetState should bootstrap the same full genesis state as
initializeState() instead of only setting ChainID/InitialHeight/LastBlockHeight;
update the recovery path that sets currentState and stateBootstrapped to call
the same initialization logic (e.g., run the InitChain/bootstrap sequence used
in initializeState) so that currentState.AppHash, DAHeight, LastBlockTime (and
any other genesis-seeded fields) are populated before the raft/recovered block
is executed; ensure you invoke the same executor/init routine (the InitChain or
genesis bootstrap used by initializeState) rather than manually setting only
those three fields.
pkg/sync/sync_service.go (1)

190-205: ⚠️ Potential issue | 🟠 Major

Unwind partially started components on startup errors.

By the time Line 233 returns from setupP2PInfrastructure, the store, exchange server, and exchange can already be running. If newSyncer fails here — or startSubscriber fails in the new publisher-mode path — these methods return without tearing those pieces down, so a failed start leaks live components into the process.

As per coding guidelines, "Be mindful of goroutine leaks in Go code".

Also applies to: 219-225, 231-245

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/sync/sync_service.go` around lines 190 - 205, The startup may leave
store/exchange/exchange-server running if later steps fail (e.g. newSyncer or
startSubscriber); update the startup sequence (functions prepareStart /
setupP2PInfrastructure / initFromP2PWithRetry) to register and invoke cleanup on
error: after setupP2PInfrastructure returns and before calling newSyncer or
startSubscriber, capture the created components (store, exchangeServer, exchange
or any returned tearDown/Close funcs) and ensure they are stopped when a
subsequent error occurs (use a deferred cleanup or explicit teardown call on
error paths in the block around prepareStart -> initFromP2PWithRetry ->
newSyncer/startSubscriber). Reference symbols to change: prepareStart,
setupP2PInfrastructure, initFromP2PWithRetry, newSyncer, and startSubscriber —
ensure each path that returns an error calls the appropriate stop/Close/Shutdown
methods for the started components to avoid goroutine/resource leaks.
🧹 Nitpick comments (1)
pkg/sync/sync_service_test.go (1)

63-97: Assert remote peer delivery, not just local initialization.

This test pays the cost of bringing up client2, but the only postcondition is svc.storeInitialized. A regression where StartForPublishing initializes locally but never broadcasts to peers would still pass. Please add an assertion on peer 2's view of the header so the new publisher-mode path is actually covered.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/sync/sync_service_test.go` around lines 63 - 97, The test currently only
asserts svc.storeInitialized after broadcasting; add an assertion that the
remote peer (client2) actually receives the broadcasted header: before calling
svc.WriteToStoreAndBroadcast, register or subscribe a handler on client2 to
capture incoming P2PSignedHeader messages (using client2's subscribe/handler
API), then after WriteToStoreAndBroadcast use require.Eventually to poll/assert
that the handler received a header equal to signedHeader (or matching
height/DataHash/AppHash). Reference client1/client2, NewHeaderSyncService,
StartForPublishing, WriteToStoreAndBroadcast and svc.storeInitialized when
locating where to add the subscription and the eventual assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/evm/go.mod`:
- Around line 5-8: Remove the repo-local replace directives from the module
manifest (the replace block that points github.com/evstack/ev-node and
github.com/evstack/ev-node/execution/evm to relative ../../ paths) so the
published module no longer contains repository-specific local wiring; keep the
go.mod clean for downstream consumers and, if local development linking is
needed, use a go.work file or temporary local replace only in your working copy
before publishing.

In `@block/internal/syncing/syncer.go`:
- Around line 1230-1235: The retry uses the identical currentState so
errInvalidState from AssertValidForNextState will always recur; before retrying
the call to trySyncNextBlockWithState, refresh the snapshot/state (e.g. read
s.GetLastState() or obtain the latest snapshot used by AssertValidForNextState)
and pass that refreshed state into trySyncNextBlockWithState instead of reusing
the original currentState, or recompute/validate the next state after calling
s.SetLastState(currentState) so the second call sees the updated global state.

In `@CHANGELOG.md`:
- Around line 12-21: The changelog contains unresolved git conflict markers
(<<<<<<<, =======, >>>>>>>) — remove those markers and produce a single resolved
section (e.g., keep "### Added") combining both bullet entries into the final
list so both PRs stay: the publisher-mode synchronization line [`#3222`] and the
"Improve execution/evm check..." line [`#3221`]; ensure only one header remains
and that the PR links and bullets are intact and properly formatted.

In `@docs/guides/raft_production.md`:
- Around line 93-100: Update the example to use the canonical aggregator flag
and drop the legacy bootstrap flag: replace the --rollkit.node.aggregator=true
occurrence with --evnode.node.aggregator=true, and remove the
--evnode.raft.bootstrap=true line (only keep raft-related flags like
--evnode.raft.enable, --evnode.raft.node_id, --evnode.raft.raft_addr,
--evnode.raft.raft_dir, and --evnode.raft.peers). Ensure the rest of the example
(e.g., --rollkit.p2p.listen_address) stays unchanged unless also intended to be
canonicalized.

In `@node/failover.go`:
- Around line 193-209: The gating logic ignores storeHeight when deciding to
start in publisher mode; change the condition that currently checks headerHeight
and dataHeight to include storeHeight so publishing only starts when the main
block store and both sync stores are empty. In other words, adjust the boolean
check around headerSyncService.Store().Height() and
dataSyncService.Store().Height() (the if that returns false) to also consider
the local storeHeight variable so StartForPublishing is only taken when
storeHeight == 0 && headerHeight == 0 && dataHeight == 0.

In `@pkg/raft/election.go`:
- Around line 93-96: The wait here uses d.node.waitForMsgsLanded which
internally uses context.Background(), so follower startup isn’t cancellable;
update this call to use the Run cancellation context (e.g., pass the existing
ctx) and add a ctx-aware variant on the node (e.g., waitForMsgsLandedCtx(ctx,
timeout) or change waitForMsgsLanded to accept a context) so the call in
election.go uses that context and honors cancellation/role changes; modify the
node implementation (pkg/raft/node.go – waitForMsgsLanded) to accept and
propagate the provided context instead of creating context.Background().

---

Outside diff comments:
In `@block/internal/syncing/syncer.go`:
- Around line 1204-1213: The fallback after s.store.GetState should bootstrap
the same full genesis state as initializeState() instead of only setting
ChainID/InitialHeight/LastBlockHeight; update the recovery path that sets
currentState and stateBootstrapped to call the same initialization logic (e.g.,
run the InitChain/bootstrap sequence used in initializeState) so that
currentState.AppHash, DAHeight, LastBlockTime (and any other genesis-seeded
fields) are populated before the raft/recovered block is executed; ensure you
invoke the same executor/init routine (the InitChain or genesis bootstrap used
by initializeState) rather than manually setting only those three fields.

In `@pkg/sync/sync_service.go`:
- Around line 190-205: The startup may leave store/exchange/exchange-server
running if later steps fail (e.g. newSyncer or startSubscriber); update the
startup sequence (functions prepareStart / setupP2PInfrastructure /
initFromP2PWithRetry) to register and invoke cleanup on error: after
setupP2PInfrastructure returns and before calling newSyncer or startSubscriber,
capture the created components (store, exchangeServer, exchange or any returned
tearDown/Close funcs) and ensure they are stopped when a subsequent error occurs
(use a deferred cleanup or explicit teardown call on error paths in the block
around prepareStart -> initFromP2PWithRetry -> newSyncer/startSubscriber).
Reference symbols to change: prepareStart, setupP2PInfrastructure,
initFromP2PWithRetry, newSyncer, and startSubscriber — ensure each path that
returns an error calls the appropriate stop/Close/Shutdown methods for the
started components to avoid goroutine/resource leaks.

In `@test/e2e/sut_helper.go`:
- Around line 189-195: The per-process logfile created by os.Create (logfile) in
the block that checks s.processLogDir() is never closed; wrap the logfile so it
is closed when the returned readers are closed by replacing the
io.NopCloser(io.TeeReader(..., logfile)) usage with a ReadCloser that closes the
underlying logfile on Close, or otherwise ensure logfile.Close() is called
(e.g., return a combined io.ReadCloser that delegates Read to the TeeReader and
Close to logfile.Close) for both errReader and outReader so the file descriptor
is released and buffered data flushed.

---

Nitpick comments:
In `@pkg/sync/sync_service_test.go`:
- Around line 63-97: The test currently only asserts svc.storeInitialized after
broadcasting; add an assertion that the remote peer (client2) actually receives
the broadcasted header: before calling svc.WriteToStoreAndBroadcast, register or
subscribe a handler on client2 to capture incoming P2PSignedHeader messages
(using client2's subscribe/handler API), then after WriteToStoreAndBroadcast use
require.Eventually to poll/assert that the handler received a header equal to
signedHeader (or matching height/DataHash/AppHash). Reference client1/client2,
NewHeaderSyncService, StartForPublishing, WriteToStoreAndBroadcast and
svc.storeInitialized when locating where to add the subscription and the
eventual assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b5476c3-9ec8-4d67-a4cb-fcde2300edbd

📥 Commits

Reviewing files that changed from the base of the PR and between 022b565 and 74641e0.

⛔ Files ignored due to path filters (1)
  • apps/evm/go.sum is excluded by !**/*.sum
📒 Files selected for processing (19)
  • CHANGELOG.md
  • apps/evm/go.mod
  • block/internal/syncing/syncer.go
  • block/internal/syncing/syncer_test.go
  • docs/guides/raft_production.md
  • docs/learn/config.md
  • node/failover.go
  • pkg/config/config.go
  • pkg/config/config_test.go
  • pkg/raft/election.go
  • pkg/raft/election_test.go
  • pkg/raft/node.go
  • pkg/raft/node_test.go
  • pkg/sync/sync_service.go
  • pkg/sync/sync_service_test.go
  • pkg/sync/syncer_status.go
  • pkg/sync/syncer_status_test.go
  • test/e2e/failover_e2e_test.go
  • test/e2e/sut_helper.go

Comment on lines +5 to +8
replace (
github.com/evstack/ev-node => ../../
github.com/evstack/ev-node/execution/evm => ../../execution/evm
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check for go.work file (monorepo pattern)
fd -H 'go.work' .

# Explore repository structure
ls -la apps/ 2>/dev/null || echo "apps directory structure"
find . -name 'go.mod' -type f | head -20

Repository: evstack/ev-node

Length of output: 631


🏁 Script executed:

# Check the complete go.mod for apps/evm
cat -n apps/evm/go.mod

Repository: evstack/ev-node

Length of output: 14568


🏁 Script executed:

# Look for CI/CD configuration to understand build patterns
find . -name '.github' -type d
find . -type f \( -name '*.yml' -o -name '*.yaml' \) -path '*/.github/workflows/*' | head -10

Repository: evstack/ev-node

Length of output: 434


🏁 Script executed:

# Check if there are any docs or comments about module publishing
grep -r "apps/evm" --include='*.md' --include='*.txt' . 2>/dev/null | head -10

Repository: evstack/ev-node

Length of output: 579


Remove repo-local replace directives from the published module manifest.

These relative paths only resolve within this repository layout and break downstream builds when github.com/evstack/ev-node/apps/evm is used as a dependency. The module is published with independent versioning (per RELEASE.md); local wiring should use the existing go.work pattern or be removed before merge.

Suggested adjustment
-replace (
-	github.com/evstack/ev-node => ../../
-	github.com/evstack/ev-node/execution/evm => ../../execution/evm
-)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
replace (
github.com/evstack/ev-node => ../../
github.com/evstack/ev-node/execution/evm => ../../execution/evm
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/evm/go.mod` around lines 5 - 8, Remove the repo-local replace directives
from the module manifest (the replace block that points
github.com/evstack/ev-node and github.com/evstack/ev-node/execution/evm to
relative ../../ paths) so the published module no longer contains
repository-specific local wiring; keep the go.mod clean for downstream consumers
and, if local development linking is needed, use a go.work file or temporary
local replace only in your working copy before publishing.

Comment on lines +1230 to +1235
err := s.trySyncNextBlockWithState(ctx, event, currentState)
if err != nil && stateBootstrapped && errors.Is(err, errInvalidState) {
s.logger.Debug().Err(err).Msg("raft recovery failed after bootstrap state init, retrying once")
// Keep strict validation semantics; this retry only guards startup ordering races.
s.SetLastState(currentState)
err = s.trySyncNextBlockWithState(ctx, event, currentState)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The retry cannot clear errInvalidState.

Lines 719-727 validate against the currentState argument, and Line 1235 passes the same snapshot again. s.SetLastState(currentState) only updates the global pointer, so an errInvalidState from AssertValidForNextState will reproduce deterministically on the retry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/syncer.go` around lines 1230 - 1235, The retry uses
the identical currentState so errInvalidState from AssertValidForNextState will
always recur; before retrying the call to trySyncNextBlockWithState, refresh the
snapshot/state (e.g. read s.GetLastState() or obtain the latest snapshot used by
AssertValidForNextState) and pass that refreshed state into
trySyncNextBlockWithState instead of reusing the original currentState, or
recompute/validate the next state after calling s.SetLastState(currentState) so
the second call sees the updated global state.

Comment on lines +93 to +96
if err = d.node.waitForMsgsLanded(d.node.Config().SendTimeout); err != nil {
// this wait can legitimately time out
d.logger.Debug().Err(err).Msg("timed out waiting for raft messages before follower verification; continuing")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep follower startup cancellable.

This new wait path ignores Run cancellation because pkg/raft/node.go:145-175 implements waitForMsgsLanded with its own context.Background() timeout. On shutdown or role changes, follower startup can now sit here for up to SendTimeout. A ctx-aware variant would keep this path responsive. As per coding guidelines, "Use context.Context for cancellation in Go".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/raft/election.go` around lines 93 - 96, The wait here uses
d.node.waitForMsgsLanded which internally uses context.Background(), so follower
startup isn’t cancellable; update this call to use the Run cancellation context
(e.g., pass the existing ctx) and add a ctx-aware variant on the node (e.g.,
waitForMsgsLandedCtx(ctx, timeout) or change waitForMsgsLanded to accept a
context) so the call in election.go uses that context and honors
cancellation/role changes; modify the node implementation (pkg/raft/node.go –
waitForMsgsLanded) to accept and propagate the provided context instead of
creating context.Background().

@alpe alpe changed the title [WIP]fix: Publisher-mode synchronization option for failover scenario fix: Publisher-mode synchronization option for failover scenario Apr 2, 2026
@alpe alpe requested a review from julienrbrt April 2, 2026 11:42
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/guides/raft_production.md (1)

93-100: ⚠️ Potential issue | 🟡 Minor

Remove the bootstrap flag from the example command.

The documentation now describes --evnode.raft.bootstrap as a compatibility-only flag (line 36) and states that "startup mode is selected automatically from persisted raft configuration state" (lines 60-62). However, the example still includes --evnode.raft.bootstrap=true on line 98, making the configuration appear more complex than necessary and contradicting the "optional" guidance.

For the primary/happy-path example, omit the bootstrap flag to demonstrate the recommended automatic mode detection.

✂️ Proposed fix
   --evnode.raft.enable=true \
   --evnode.raft.node_id="node-1" \
   --evnode.raft.raft_addr="0.0.0.0:5001" \
   --evnode.raft.raft_dir="/var/lib/ev-node/raft" \
-  --evnode.raft.bootstrap=true \
   --evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \
   --evnode.p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \

As per coding guidelines, docs/guides/**/*.md: Structure guides with clear step-by-step instructions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/raft_production.md` around lines 93 - 100, The example command
still includes the compatibility-only flag --evnode.raft.bootstrap=true which
contradicts the guide text that startup mode is selected automatically; remove
the --evnode.raft.bootstrap option from the example lines (the stanza containing
--evnode.raft.enable, --evnode.raft.node_id, --evnode.raft.raft_addr,
--evnode.raft.raft_dir, --evnode.raft.peers, and --evnode.p2p.listen_address) so
the primary/happy-path example demonstrates automatic mode detection without the
bootstrap flag.
🧹 Nitpick comments (2)
docs/guides/raft_production.md (2)

58-67: Fix markdown formatting issues.

Static analysis flagged several markdown formatting violations in this section:

  • Line 58: Missing blank line above the heading
  • Lines 60-66: Inconsistent spacing after list markers (3 spaces instead of 1)

These don't affect the rendered output in most viewers but should be fixed for consistency and linter compliance.

🔧 Proposed formatting fixes
 
 ### 1. Static Peering & Automatic Startup Mode
 Use static peering with automatic mode selection from local raft configuration:
-*   If local raft configuration already exists in `--evnode.raft.raft_dir`, the node starts in rejoin mode.
-*   If no local raft configuration exists yet, the node bootstraps from configured peers.
-*   `--evnode.raft.bootstrap` is retained for compatibility but does not control mode selection.
-*   **All configured cluster members** should list the full set of peers in `--evnode.raft.peers`.
+* If local raft configuration already exists in `--evnode.raft.raft_dir`, the node starts in rejoin mode.
+* If no local raft configuration exists yet, the node bootstraps from configured peers.
+* `--evnode.raft.bootstrap` is retained for compatibility but does not control mode selection.
+* **All configured cluster members** should list the full set of peers in `--evnode.raft.peers`.
 *   The `peers` list format is strict: `NodeID@Host:Port`.
-*   **Limitation**: Dynamic addition of peers (run-time membership changes) via RPC/CLI is not currently exposed.
-*   **Not supported**: Joining an existing cluster as a brand-new node that was not part of the initial static membership.
+* **Limitation**: Dynamic addition of peers (run-time membership changes) via RPC/CLI is not currently exposed.
+* **Not supported**: Joining an existing cluster as a brand-new node that was not part of the initial static membership.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/raft_production.md` around lines 58 - 67, Add a blank line before
the "### 1. Static Peering & Automatic Startup Mode" heading and normalize the
list bullets to use a single space after each marker and consistent single-space
indentation; update the list items referencing flags like
--evnode.raft.raft_dir, --evnode.raft.bootstrap, --evnode.raft.peers and the
peers format `NodeID@Host:Port` so each bullet begins with "- " (dash + one
space) and remove extra spaces after markers to satisfy markdown linter
expectations.

36-36: Clarify what "compatibility flag" means.

The description states the bootstrap flag is for "compatibility" but doesn't explain with what or whether it has any effect. Consider being explicit: does the flag still influence behavior, or is it completely ignored? If it's retained only for backward config-file compatibility (parsed but unused), state that clearly.

📝 Suggested clarification
-| `--evnode.raft.bootstrap` | `raft.bootstrap` | Compatibility flag. Startup mode is selected automatically from persisted raft configuration state. | optional |
+| `--evnode.raft.bootstrap` | `raft.bootstrap` | Legacy flag (parsed but unused). Startup mode is selected automatically from persisted raft configuration state. | omit |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/raft_production.md` at line 36, The table entry for
`--evnode.raft.bootstrap` / `raft.bootstrap` is vague about "compatibility
flag"; update the description to explicitly state its current behavior (e.g., if
it is retained only for backward compatibility and parsed but ignored at
runtime, change the text to "Deprecated compatibility flag — parsed for backward
compatibility but ignored; startup mode is selected automatically from persisted
Raft configuration state"; if it still affects startup behavior, state that and
describe the exact effect). Make the change to the table cell containing
`--evnode.raft.bootstrap` / `raft.bootstrap` so readers know whether the flag is
effective or merely retained for legacy config parsing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/guides/raft_production.md`:
- Around line 93-100: The example command still includes the compatibility-only
flag --evnode.raft.bootstrap=true which contradicts the guide text that startup
mode is selected automatically; remove the --evnode.raft.bootstrap option from
the example lines (the stanza containing --evnode.raft.enable,
--evnode.raft.node_id, --evnode.raft.raft_addr, --evnode.raft.raft_dir,
--evnode.raft.peers, and --evnode.p2p.listen_address) so the primary/happy-path
example demonstrates automatic mode detection without the bootstrap flag.

---

Nitpick comments:
In `@docs/guides/raft_production.md`:
- Around line 58-67: Add a blank line before the "### 1. Static Peering &
Automatic Startup Mode" heading and normalize the list bullets to use a single
space after each marker and consistent single-space indentation; update the list
items referencing flags like --evnode.raft.raft_dir, --evnode.raft.bootstrap,
--evnode.raft.peers and the peers format `NodeID@Host:Port` so each bullet
begins with "- " (dash + one space) and remove extra spaces after markers to
satisfy markdown linter expectations.
- Line 36: The table entry for `--evnode.raft.bootstrap` / `raft.bootstrap` is
vague about "compatibility flag"; update the description to explicitly state its
current behavior (e.g., if it is retained only for backward compatibility and
parsed but ignored at runtime, change the text to "Deprecated compatibility flag
— parsed for backward compatibility but ignored; startup mode is selected
automatically from persisted Raft configuration state"; if it still affects
startup behavior, state that and describe the exact effect). Make the change to
the table cell containing `--evnode.raft.bootstrap` / `raft.bootstrap` so
readers know whether the flag is effective or merely retained for legacy config
parsing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b93e6a52-29c2-4bb1-a32e-10cacf79f60b

📥 Commits

Reviewing files that changed from the base of the PR and between 53f5ded and a2607e5.

📒 Files selected for processing (1)
  • docs/guides/raft_production.md

Source: "",
})
}
err := s.trySyncNextBlockWithState(ctx, event, currentState)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is effectivelly to use the correct state when we override it at L1207 right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, before it was re-read in the method and could be empty

func (syncService *SyncService[H]) startSyncer(ctx context.Context) error {
if syncService.syncerStatus.isStarted() {
return nil
// StartForPublishing starts the sync service in publisher mode.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to wrap my head around this. Why is it exactly needed? A node can fetch block 1 over DA and p2p 🤔.
Or is it because the connection happens before DA fetching, effectively deadlocking it?

Copy link
Copy Markdown
Contributor Author

@alpe alpe Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a race that could lead to a deadlock:

  • P2P init waits for block 1 from peers.
  • Block 1 production waits for P2P init to finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants